Introduction to Exploratory Data Analysis and Applied Statistical Techniques

Module 01

Ray J. Hoobler

Data Visualization

Visualizations Are Not New

1977

“The simple graph has brought more information nto the data analyst’s mind than any other device.”

—John Tukey

Exploratory Data Analysis by John Tukey(Tukey 1977), is now considered a classic in the field of data analysis and statistics.

Four chapters are devoted to Graphic Presentation in my copy of Applied General Statistics (Croxton and Cowden 1946). (The book was first published in 1939.)

R for Data Science (2e)

2023

R for Data Science is an introduction into data manipulation and visualization. The authors are proponents of the tidyverse and ggplot2. The tidyverse is a collection of R packages designed for data science. This is in contrast to base R.

The tidyverse provides an integrated framework that allows beginners to quickly get up to speed with data manipulation.

ggpot2 is a plotting system for R, based on the grammar of graphics. Once you become familiar with ggplot, you will see it’s presence in many publications. A Layered Grammar of Graphics (Wickham 2010) provides the philosophical framework for ggplot2.

Prerequisites

Before you begin any readings, you should have R and RStudio installed on your computer.

Follow the instructions on the Posit.co website for installing the RStudio IDE (integrated development environment).

  1. Install R from the Rstudio.com mirror of the CRAN website.
  2. Install RStudio from Posit.co.

Getting Started

Once you have R and RStudio installed, start RStudio and type library(tidyverse) in the console.

Code
library(tidyverse)


You’ll see the following message the first time you load the package.

The Palmer Penguins Dataset

The Palmer Penguins dataset is a popular dataset for learning data visualization. It is bundled with the palmerpenguins package. The dataset was created by Allison Horst, Alison Hill, and Kristen Gorman. The dataset is available on GitHub.

Code
library(palmerpenguins)

Data Frames

Data frames will be the default data structure we use in this course. Data frames should look familiar to anyone who has used spreadsheets.

Code
penguins

Variables are in columns and observations are in rows.

“Ultimate goal” for Chapter 1 in R for Data Science

Code
library(ggthemes)

ggplot(
  data = penguins, 
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
  ) + 
  geom_point(mapping = aes(color = species)) +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species"
  ) +
  scale_color_colorblind()

Creating a ggplot: Step 1

Code
ggplot(data = penguins)

Creating a ggplot: Step 2

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
)

Creating a ggplot: Step 3

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) + 
  geom_point()

Warning

Warning: Removed 2 rows containing missing values or values outside the scale range (geom_point()).

Creating a ggplot: Step 4

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) + 
  geom_point()

Creating a ggplot: Step 5

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g, color = species)
) + 
  geom_point() +
  geom_smooth(method = "lm")

Important

When aesthetic mappings are defined in the ggplot() function, they are inherited by all layers.

The aesthetic “color” is being applied to both the geom_point() and geom_smooth() layers.

Creating a ggplot: Step 6

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species)) +
  geom_smooth(method = "lm")

Creating a ggplot: Step 7

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm")

Creating a ggplot: Step 8

```{r}
#| code-fold: show

ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_colorblind()
```

Code
ggplot(
  data = penguins,
  mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
  geom_point(mapping = aes(color = species, shape = species)) +
  geom_smooth(method = "lm") +
  labs(
    title = "Body mass and flipper length",
    subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
    x = "Flipper Length (mm)",
    y = "Body Mass (g)",
    color = "Species",
    shape = "Species"
  ) +
  scale_color_colorblind()

Module 1 Assignment 1

Create a new Quarto html document and answer questions 1 through 10 in the R for Data Science section: 1.2.5 Exercises.

Exploratory Data Analysis

NIST/SEMATECH e-Handbook of Statistical Methods

The NIST/SEMATECH e-Handbook of Statistical Methods is a collaborative project involving the National Institute of Standards and Technology (NIST) and SEMATECH.

NIST is a non-regulatory federal agency within the U.S. Department of Commerce. The main role of NIST is to promote U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology.

SEMATECH was a research consortium comprised of semiconductor manufacturers and suppliers.

EDA Techniques

In-line block code would look like this.

EDA Assumptions

End of Module 01

References

Croxton, Frederick E., and Dudley J. Cowden. 1946. Applied General Statistics. New York: Prentice-Hall.
Tukey, John Wilder. 1977. Exploratory Data Analysis. Addison-Wesley Series in Behavioral Science. Reading, Mass: Addison-Wesley Pub. Co.
Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28. https://doi.org/10.1198/jcgs.2009.07098.